New onset delirium prediction using machine learning and long short-term memory (LSTM) in electronic health record

Abstract Objective To develop and test an accurate deep learning model for predicting new onset delirium in hospitalized adult patients. Methods Using electronic health record (EHR) data extracted from a large academic medical center, we developed a model combining long short-term memory (LSTM) and machine learning to predict new onset delirium and compared its performance with machine-learning-only models (logistic regression, random forest, support vector machine, neural network, and LightGBM). The labels of models were confusion assessment method (CAM) assessments. We evaluated models on a hold-out dataset. We calculated Shapley additive explanations (SHAP) measures to gauge the feature impact on the model. Results A total of 331 489 CAM assessments with 896 features from 34 035 patients were included. The LightGBM model achieved the best performance (AUC 0.927 [0.924, 0.929] and F1 0.626 [0.618, 0.634]) among the machine learning models. When combined with the LSTM model, the final model’s performance improved significantly (P = .001) with AUC 0.952 [0.950, 0.955] and F1 0.759 [0.755, 0.765]. The precision value of the combined model improved from 0.497 to 0.751 with a fixed recall of 0.8. Using the mean absolute SHAP values, we identified the top 20 features, including age, heart rate, Richmond Agitation-Sedation Scale score, Morse fall risk score, pulse, respiratory rate, and level of care. Conclusion Leveraging LSTM to capture temporal trends and combining it with the LightGBM model can significantly improve the prediction of new onset delirium, providing an algorithmic basis for the subsequent development of clinical decision support tools for proactive delirium interventions.


INTRODUCTION
Delirium is an acute decline in cognitive function leading to confusion, which occurs in 29% to 65% of hospitalized older patients. [1][2][3] Patients with delirium experience a serious constellation of neuropsychiatric symptoms, resulting in higher mortality, in-hospital falls, and the need for longterm care. [4][5][6][7] The risk of mortality increases by 11% for every additional 48 h after the onset of delirium. 8 In addition, delirium is associated with continued deterioration in cognitive function, 9 as well as reduced functional status, 6 and it negatively affects mental health status (eg, depression, anxiety, and post-traumatic stress disorders). 10 It is also a known risk factor leading to new onset dementia. 11 Both conditions lead to decreased life satisfaction, 12,13 and a significant burden on patients and caregivers. 14 Prevention is considered the most effective way to control delirium, 15 and more than two-thirds of delirium cases are preventable. 16 Current detection methods rely on periodic assessments by nurses, such as the confusion assessment method (CAM). The CAM includes 4 components: (1) acute onset and fluctuating course, (2) inattention, (3) disorganized thinking, and (4) altered level of consciousness. [17][18][19] However, CAM has the following limitations. First, it cannot continuously track patient status. A common interval for CAM assessments is every 12 h for hospitalized adults, which might lead to delays in delirium recognition and proactive interventions. 20,21 Second, CAM can accurately determine the presence of delirium when it occurs, but cannot predict future states. Third, a CAM assessment requires patient participation, which interrupts sleep and is unattainable for patients who are under deep sedation. Lack of early detection remains a pressing issue that hinders healthcare providers from providing timely and effective interventions, for example, ABCDEF Bundle. 16,22 Previous studies have attempted to apply machine learning methods to predict delirium or delirium-related diseases; however, several gaps remain when using prediction models in real clinical settings. 23 First, previous studies have primarily used data from clinical trials to develop models, which have strict criteria for patient selection and data sets that are generally more complete and smaller than typical clinical use cases. 24 For example, a recent model was developed on a dataset of 1026 patients with excluded dementia. 24 Whereas epidemiological evidence suggests that the presence of dementia is a substantial contributor to delirium and can increase the risk of delir-ium by 2-5 times. 11 Therefore, the model's predictive performance in hospitalized patients is yet to be validated. Second, another previous study that used International Classification of Diseases (ICD) codes to label delirium yielded a presence rate of only 1.5%. 25 However, it has been shown that using ICD only identifies 18% of the total delirium cases, so the model would result in a large number of delirious patients undetected. 26 Third, the existing delirium prediction models were insufficient to account for temporal data. Most of them predicted delirium based on the features collected, with each record considered as an independent case. In clinical usage, each patient usually has multiple CAM assessments during hospitalization, and each assessment and the associated feature values should be considered as continuous data that may affect the subsequent delirium status.
The purpose of this study was to develop accurate deep learning models to predict new onset delirium in hospitalized adult patients. We proposed a method to utilize an LSTM-based model to capture temporal correlations to predict delirium status based on several previous CAM assessments and feature values in a time series. For patients without multiple CAM assessments yet, we used a machine learning model to predict delirium based on static data. Our study utilized a generalizable dataset that was routinely collected from Vanderbilt University Medical Center (VUMC)'s electronic health record (EHR) system for approximately 4 years. In addition, considering clinical practice, we predicted the new onset of delirium (ie, new positive CAM assessment) 27 and pro- vided visual interpretations of the predictions. The research was conducted at VUMC and was approved by the Vanderbilt University Institutional Review Board.

Study design and population
We extracted all adult patients who had a CAM assessment between January 1, 2018 and October 1, 2021 in the intensive care unit (ICU) from VUMC's clinical data warehouse. We excluded CAM assessments performed less than 12 h after the time of arrival on the unit and CAM assessments after new onset delirium. At VUMC, nurses conducted routine CAM assessments to assess delirium status in the ICU. The prediction label was based on the result of the CAM assessment (ie, positive or negative). Diagnosis of delirium using CAM requires the presence of feature 1 (acute onset or fluctuating course) and feature 2 (inattention) and either feature 3 (disorganized thinking) or feature 4 (altered level of consciousness). 27

Data collection and preprocessing
The goal of our study is to predict delirium before it occurs. We assessed 3 time windows: 6, 12, and 24 h before the onset of the delirium event. For each time window, we collected the latest values from model features generated at least that many hours before the next CAM assessment. For example, when the time window was 6 h, we only considered data at least 6 h prior to the CAM assessment. We collected 896 features from the following EHR data domains: medications, vital signs, laboratory values, active problems, historical problems, type of surgery, social history, procedures, and hospital admission. For each feature, we calculated the missing rate in the training dataset and removed features with missing rates >0.99. We used the Clinical Classifications Software to map diagnosis codes into categories. 28 The preprocessing process consisted of 3 steps: (1) imputation of missing values, (2) scaling, and (3) encoding categorical features. Categorical features were reported as counts and percentages. Numerical features were reported as mean with standard deviation (SD) and median with interquartile range (IQR).

Machine learning model development and evaluation
We split the dataset at the patient level into a training dataset (80%) and a testing dataset (20%). The testing dataset was used as a holdout dataset for external validation. We used 5-fold cross validation on the training dataset to tune hyperparameters in models. After obtaining the optimal hyperparameters, we developed models using the training dataset, then performed 1000-round bootstrapping with the hold-out testing dataset to report the results. We developed logistic regression, random forest, support vector machine, and LightGBM 29 models. Gradient boosting decision tree models have been applied to other clinical tasks with excellent performance compared to traditional machine learning algorithms. [30][31][32] We predicted the risk of new onset delirium within 6, 12, and 24 h, respectively. We reported outcomes in F1, accuracy, area under the receiver operating characteristic curve (AUC), recall, and precision. To evaluate the overall performance, we plotted receiver-operating characteristic curves and precision-recall curves. The receiver-operating characteristic is the ratio of sensitivity to (1-specificity). Models with a larger AUC are considered to have better performance. On the other hand, the precision-recall curve illustrates the trade-off between recall and precision. Models with high performance tend to have a balance of high recall and precision, yielding large F1 values. Machine learning model development and evaluation were done using the following packages: numpy, pandas, matplotlib, sklearn, and lightgbm.

Statistical analysis
To compare the characteristics between patients with and without delirium in the cohort, we performed Welch t tests for numerical features and Chi-square tests for categorical features. To compare the performance of different models, we conducted a Friedman test 33 on F1 values with a follow-up Nemenyi test 34 for pairwise comparisons. 35 P<.05 was considered to be statistically significant.

Model explainability
We calculated Shapley additive explanations (SHAP) values 36 for each feature and applied the SHAP framework to interpret each prediction on the hold-out set. SHAP values are intended to explain complex "black-box" machine learning models, for example, neural networks and gradient boosting tree-based models. 36 The SHAP framework provides a unique solution with important properties (local accuracy, missingness, and consistency) based on additive fea- ture attribution methods and game theory. It is calculated by comparing the predicting differences in all possible combinations containing and withholding each feature. It shows better consistency and accuracy with human intuition compared to previous approaches to model interpretation.

Machine learning and LSTM combined model
For data preprocessing, we used the same training and testing datasets as in the previous section on machine learning, with the data partitioned at the patient level such that our testing and training sets included nonoverlapping subsets of patients. The deep learning model includes 2 parts: (1) a fixed-length LSTM-based model and (2) a machine learning model. To develop the LSTM-based model, we selected encounters with at least 4 CAM assessments in the training set based on the median number of CAM assessments per hospitalization in our dataset of 4. To capture temporal relationships we chose the LSTM method, a state-of-the-art deep learning model designed to analyze sequential data. 37 In the LSTM-based model, we developed embedding layers to convert each categorical feature into 2-dimensional dense real-valued vectors (R 2 ). Numerical features were imputed using mean value, transformed using a standard scaler, and connected to the embedded vectors via a concatenated layer. In addition to the 4 previous CAM assessments and associated features, we developed a neural network to integrate into the LSTM model the most recent features generated at least 6 h prior to prediction through another concatenated layer. Our proposed LSTM-based model is shown in Figure 1. The units of the LSTM layer and the dense layer, the learning rate, and the dropout rate were tuned by using Hyperband Tuner in Keras, an efficient hyperparameter optimization approach widely used in deep learning. 38 The model was trained using an Adam optimizer and a binary cross-entropy loss function. Second, for the delirium status in the first 4 assessments during each hospitalization in the testing dataset, we selected the optimal machine learning model developed in the previous section to make predictions.

Patient characteristics
A total of 331 489 CAM assessments from 34 035 patients with 39 567 encounters were included in the final dataset. The characteristics of patients are listed in Table 1. The median age of patients was 59 years with an IQR [44,70]. A total of 37 246 were positive CAM assessments (11.2%). Patients in the delirium group were older, most had public insurance, and had longer length of stay (P<.001). Race and sex were not significantly different in 2 groups. We extracted 896 features: medications (195), vital signs (10), laboratory values (39), active problems (161), historical problems (84), type of surgery (108), social history (5), procedures (279), and hospital admission (13). Examples of features are presented in Table 2. The median number of CAM assessments in each hospitalization was 4 with an IQR. 2,9 Machine learning model performance   presented in Table 4. In the Friedman test, all metrics identified significant differences between the LightGBM models for 3 different time windows. In the Nemenyi post-hoc test, the LightGBM 6 h model had significantly higher specificity, precision, accuracy, F1, and AUC than associated metrics for the other 2 models (P ¼.001).
The recall and NPV of the LightGBM 6 h model were not significantly different compared to the LightGBM 12 h model, but were significantly higher than the metrics of the LightGBM 24 h model. The AUCs and the precision-recall curves are shown in Figure 2. The precision-recall curves of the LightGBM 6 h model and the  LightGBM 12 h model were close; however, when the recall was fixed in a large value, the precision value of the LightGBM 12 h models was much smaller than the precision value of the LightGBM 6 h model. This suggests that the LightGBM model has a more robust performance in predicting new onset delirium within 6 h.

Model explainability
Using the mean absolute SHAP values, we determined the top 20 features including age, heart rate, Richmond Agitation-Sedation Scale (RASS) score, fall risk, pulse, respiratory rate, level of care, and the number of previous CAM assessments in this encounter. We   also identified 3 laboratory values (ammonia level, lactate blood test, and pO2 venous), 3 medications (intravenous anesthesia, atypical antipsychotics, and opioid analgesic anesthetic adjunct agents), and 6 procedures (eg, CT head without contrast and portable X-rays anteroposterior chest) as important features for predicting new onset delirium. In Figure 3, we presented the relationships between their values and the effect of the model output. Furthermore, we provided scatter plots (Figure 4) for several important identified features (eg, ammonia level, RASS score, and age). The gray histogram shows the distribution of values. For continuous features, we also added yellow lines to represent regression lines. For example, when ammonia levels exceeded the normal range (15-45 m/dL), SHAP values increased and the effect on the model results became larger. For the RASS score, the effect on the prediction model was minimal when the patient was alert and calm (RASS score ¼ 0). In addition, negative RASS scores had a greater impact on model prediction compared to positive RASS scores. For the age, we observed that the SHAP value increased when the patient's age increased.
In addition to the overall effect, we applied the SHAP framework to explain individual cases by providing influential features. Figure 5 shows 2 examples-a negative prediction (top) and a positive prediction (bottom). Features in blue represent features that contribute to a lower risk while features in red will push up the risk. These visualizations give users detailed information about how the model makes predictions and allow them to make appropriate interventions before the new onset delirium.  Supplementary Table S1. Because we wanted to be able to make predictions for patients with fewer than 4 prior CAM assessments, we created a final model (LightGBMþLSTM) which used the LightGBM model for predictions where there were fewer than 4 prior CAM scores, and then switched to the more accurate LSTM once at least 4 scores had been recorded. The combined LightGBMþLSTM model had an AUC score of 0.952 [0.950, 0.955] and an F1 score of 0.759 [0.755, 0.765]. Other metrics are reported in Table 5. In the Friedman test, all metrics from the combined model were significantly different from the original LightGBM model (P <.001). In the Nemenyi posthoc test, the F1 and AUC scores were significantly higher for the combined model (P ¼.001).

Machine learning and LSTM combined model performance
The AUCs and the precision-recall curves are shown in Figure 6. The AUCs of the LightGBM model and the combined model were similar; however, the precision-recall curves were different. When the recall was set to 0.8, the precision value of the combined model increased from 0.497 to 0.751 compared to the LightGBM model, an increase of 51%. The increments in precision values for other fixed recall values are reported in Table 6.

Principal findings
In this study, we developed a novel LSTM and LightGBM combined model to predict new onset delirium and evaluated the predictive capability of the model using EHR data generated directly from routine healthcare activities. This algorithm has better performance  than the traditional machine learning model. It has the potential to be implemented as a clinical decision support (CDS) tool integrated into an EHR system. This means healthcare providers can obtain high performance risk assessments between manual CAM assessments and be able to provide proactive and timely interventions for high-risk patients. The important features found in the LightGBM model are supported by clinical evidence. For example, a systematic review reported that the elevated levels of ammonia were associated with severe hepatic encephalopathy, 39 a cause of delirium. 40 The 3 medications found (injectable anesthesia, antipsychotic atypical dopamine serotonin antagonist, and opioid analgesic anesthetic adjunct agents) are also mentioned in other studies. 1,41 Previous evidence suggests that older patients are at higher risk of delirium when exposed to anesthetics 1 and higher risk of postoperative delirium when exposed to psychoactive drugs (benzodiazepines, opioids). 41 Respiration rates and RASS scores have also been identified as significant predictors in previous prediction models. 24 Although imaging is part of the predictive model, we are aware that imaging decisions may be based on clinical suspicion and/or protocolized care (eg, chest X-ray to evaluate endotracheal tube position). Therefore, the interpretation of imaging data would require clinical correlation. In the model explanation, we observed that the negative RASS scores had a greater impact on model predictions than positive RASS scores, suggesting that our model appears to be more capable of predicting patients with hypoactive delirium. Hypoactive delirium is an important subtype of delirium that is usually more common than hyperactive delirium. [42][43][44] In addition, long durations of hypoactive delirium lead to long-term cognitive decline. 45 However, because the patient exhibits fewer behavioral problems, 46,47 it is often difficult to detect resulting in underreporting. 44 In this study, we also developed other Bidirectional Encoder Representations from Transformers (BERT)-based models for analyzing clinical notes and found that neither clinical notes alone nor in combination with unstructured data could achieve higher performance in predicting new onset delirium.
We found that using an LSTM-based model to treat historical CAM assessments and associated features as longitudinal data can substantially improve predictive performance. It indicates that the trajectory of historical data may also be informative in predicting delirium. This finding is consistent with other disease predictions, for example, heart disease. 48 In addition, we combined LSTM with machine learning to provide predictions at the beginning of the time series, which was often ignored by previous time series studies of healthcare data. We also found that using the time interval of CAM assessments as the interval of the timestamps to integrate features is feasible in providing accurate predictions. Specifically, for each time point in our time series corresponding to a CAM assessment, features were selected from data generated 6 h prior to that CAM assessment. Previous studies typically aggregate data on an hourly basis, potentially generating more noise and imposing higher demands on model training. We identified a recent study that developed an LSTM-based model to predict delirium status at least 24 h after hospitalization based on 21 features. 49 Our study used a more extensive set of over 900 features, while the machine learning part we introduced in the combined model could provide predictions when there was not enough historical data to run the LSTM-based model. In addition, as a critical step for implementation in the clinic, the performance of the prediction model should be considered. In the reported model, the maximum AUC was 88.39% with a precision and recall of 37.52% and 86.18%, respectively, that is, only 38 out of 100 predicted delirium diagnoses will occur, which would place an additional burden on health providers, especially in the ICU environment.

Limitations
This study has several limitations. First, we developed models based on a dataset from a single medical center. Exploring the predictability of this model on other healthcare systems might add more value. However, it should be noted that the dataset was extracted from a large tertiary referral center with a broad catchment area. In addition, we used a hold-out testing dataset containing different patients for external validation. Third, as a retrospective study, the impact of predicting new onset delirium on patient outcomes is still unknown.

Future work
Future work in this area should link delirium prediction with evidence-based actions through clinical decision support formats. It includes designing interactive interfaces, exploring better presentations to explain model behavior based on clinician needs, implementing it in the workflow, and further exploring the impact of the model on clinician behavior as well as patient outcomes. Another direction is to predict different types of delirium (ie, hypoactive delirium, hyperactive delirium, and mixed delirium) and to provide clinicians with corresponding actionable interventions for each type through CDS tools.

CONCLUSION
Delirium remains a serious risk factor for older patients in the ICU and is one of the key directions for aging research. Early detection of new onset delirium in the clinical workflow is a critical step to enhancing patient monitoring and improving patient outcomes. We developed a deep learning prediction model for new onset delirium within 6 h using data generated directly from the EHR. The LSTM layer inside the model could capture the temporal relationships in historical data. This new model has excellent performance in predicting new onset delirium, which provides a solid technical basis for the intelligent CDS tool for delirium prediction in a future implementation study.

FUNDING
This work was supported by NIH grant: R01AG062499-01 and K99LM014097-01.

AUTHOR CONTRIBUTIONS
SL conducted feature identification, data extraction, model developing, statistical analysis, and drafting the work. SL, AM, JS, AW, BS, TR, TK, and ER helped to design experiments and revise the drafted manuscript. SL and TK performed a literature review. All authors approved the submitted version.

SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.